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ABSTRACT 


This  report  presents  an  overview  of  a  program  of  speech  recognition  research 
which  was  initiated  in  1985  with  the  major  goal  of  developing  techniques  for  robust 
high  performance  speech  recognition  under  the  stress  and  noise  conditions  typical  of 
a  military  aircraft  cockpit.  The  work  on  recognition  in  stress  and  noise  during  1985 
and  1986  produced  a  robust  Hidden  Markov  Model  (HMM)  isolated-word  recogni¬ 
tion  (IWR)  system  with  99  percent  speaker-dependent  accuracy  for  several  difficult 
stress/noise  data  bases,  and  very  high  performance  for  normal  speech.  Robustness 
techniques  which  were  developed  and  applied  include  multi-style  training,  robust 
estimation  of  parameter  variances,  perceptually-motivated  stress-tolerant  distance 
measures,  use  of  time-differential  speech  parameters,  and  discriminant  analysis. 
These  techniques  and  others  produced  more  than  an  order-of-magnitude  reduction  in 
isolated-word  recognition  error  rate  relative  to  a  baseline  HMM  system.  An  impor¬ 
tant  feature  of  the  Lincoln  HMM  system  has  been  the  use  of  continuous-observation 
HMM  techniques,  which  provide  a  good  basis  for  the  development  of  the  robustness 
techniques,  and  avoid  the  need  for  a  vector  quantizer  at  the  input  to  the  HMM 
system.  Beginning  in  1987,  the  robust  HMM  system  has  been  extended  to  continu¬ 
ous  speech  recognition  for  both  speaker-dependent  and  speaker-independent  tasks. 
The  robust  HMM  continuous  speech  recognizer  was  integrated  in  real-time  with  a 
stressing  simulated  flight  task,  which  was  judged  to  be  very  realistic  by  a  number  of 
military  pilots.  Phrase  recognition  accuracy  on  the  limited-task-domain  (28-word 
vocabulary)  flight  task  is  better  than  99.9  percent.  Recently,  the  robust  HMM  sys¬ 
tem  has  been  extended  to  large- vocabulary  continuous  speech  recognition,  and  has 
yielded  excellent  performance  in  both  speaker-dependent  and  speaker-independent 
recognition  on  the  DARPA  1000-word  vocabulary  resource  management  data  base. 
Current  efforts  include  further  improvements  to  the  HMM  system,  techniques  for 
the  integration  of  speech  recognition  with  natural  language  processing,  and  research 
on  integration  of  neural  network  techniques  with  HMM. 
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1.  INTRODUCTION  AND  SUMMARY 


Since  1985.  the  Speech  Systems  Technology  Group  at  Lincoln  Laboratory  has  been  carrying 
out  a  program  of  speech  recognition  research,  as  part  of  a  multi-laboratory  program  sponsored 
by  the  Defense  Advanced  Research  Projects  Agency  (DARPA).  During  the  first  three  years  of 
the  effort,  the  particular  focus  of  the  Lincoln  program  was  the  development  of  robust  recognition 
techniques  to  cope  with  the  stress  and  noise  conditions  typical  of  the  fighter  cockpit,  but  which 
would  be  applied  in  situations  where  limited  vocabularies  and  constrained  tasks  are  acceptable. 
More  recently,  the  work  has  moved  on  to  the  more  general  problem  of  large-vocabulary  continuous 
speech  recognition  aimed  at  applications  where  more  natural  spoken  language  input  is  required. 

This  report  presents  an  overview  of  the  Lincoln  speech  recognition  program  from  1985  through 
the  present.  Some  highlights  of  the  program  are  summarized  briefly  here.  The  program  was  ini¬ 
tiated  in  1985  with  the  major  goal  of  developing  techniques  for  robust  high  performance  speech 
recognition  under  the  stress  and" noise  conditions  typical  of  a  military  aircraft  cockpit.  The  work 
on  recognition  in  stress  and  noise  during  1985  and  1986  produced  a  robust  Hidden  Markov  Model 
(HMM)  isolated-word  recognition  (IWR)  system  with  99  percent  speaker-dependent  accuracy  for 
several  difficult  stress/noise  data  bases,  and  very  high  performance  for  normal  speech.  Robustness 
techniques,  which  were  developed  and  applied,  include  multi-style  training,  robust  estimation  of  pa¬ 
rameter  variances,  perceptually-motivated  stress-tolerant  distance  measures,  use  of  time-differential 
speech  parameters,  and  discriminant  analysis.  These  techniques  and  others  produced  more  than  an 
order-of-magnitude  reduction  in  isolated-word  recognition  error  rate  relative  to  a  baseline  HMM  sys¬ 
tem.  An  important  feature  of  the  Lincoln  HMM  system  has  been  the  use  of  continuous-observation 
HMM  techniques,  which  appear  to  enhance  robustness,  and  avoid  the  need  for  a  vector  quantizer 
at  the  input  to  the  HMM  system.  Beginning  in  1987.  the  robust  HMM  system  has  been  extended 
to  continuous  speech  recognition  for  both  speaker-dependent  and  speaker-independent  ta«ks.  The 
robust  HMM  continuous  speech  recognizer  was  integrated  in  real-time  with  a  stressing  simulated 
flight  task,  which  was  judged  to  be  very  realistic  by  a  number  of  military  pilots.  Phrase  recognition 
accuracy  on  the  limited-task-domain  (28-word  vocabulary)  flight  task  is  better  than  99.9  percent. 
Recently,  the  robust  HMM  system  has  been  extended  to  large-vocabulary  continuous  speech  recog¬ 
nition.  and  has  yielded  excellent  performance  in  both  speaker-dependent  and  speaker-independent 
recognition  on  the  DARPA  1000-word  vocabulary  resource  management  data  base.  Current  efforts 
include  further  improvements  to  the  HMM  system,  techniques  for  the  integration  of  speech  recog¬ 
nition  with  natural  language  processing,  and  research  on  integration  of  neural  network  techniques 
with  HMM,  aimed  at  further  improvements  in  recognition  performance. 

The  organization  of  this  report  is  as  follows.  Section  2  describes  the  stress  robustness  problem, 
focussing  on  the  effects  of  an  aircraft  environment  (e.g.,  a  military  fighter  cockpit)  on  speech  and  on 
speech  recognition.  In  Section  3  the  technical  approach  to  robust  recognition  is  outlined,  focussing 
primarily  on  robustness  enhancements  to  an  HMM  isolated-word  recognition  system.  Section  4 
summarizes  the  stress/noise  data  bases  that  have  been  used,  and  Section  5  outlines  the  results 
obtained  for  isolated-word  recognition  both  for  stressed  and  normal  speech.  The  features  of  the 
robust  continuous  speech  recognition  system  are  outlined  in  Section  6.  which  actually  describes  a 


class  of  evolving  robust  continuous  speech  recognition  systems  which  have  been  applied  to  different 
conditions  and  data  bases.  Section  7  describes  the  voice-controlled  flight  simulator,  which  has  been 
an  effective  demonstration  experiment  for  real-time  speech  recognition  in  a  stressful,  time-critical 
task.  Section  8  summarizes  current  results  in  large- vocabulary  continuous  speech  recognition.  New 
work  in  the  integration  of  neural  networks  and  HMM  is  described  in  Section  9.  and  Section  10 
discusses  conclusions  and  areas  for  future  work. 

This  report  is  intended  as  a  summary  overview  of  work  at  Lincoln  performed  over  several  years; 
the  details  are  covered  in  the  references  [1]— (25j  published  at  the  IEEE  International  Conference  on 
Acoustics.  Speech  and  Signal  Processing  (ICASSP)  and  in  other  places. 

Our  work  builds  on  and  is  influenced  by  the  efforts  of  numerous  past  researchers  and  present 
colleagues  in  speech  recognition,  and  particularly  in  HMM  technology.  A  selection  of  background 
HMM  and  speech  recognition  references  is  [26]— [31] .  which  is  necessarily  only  a  partial  list.  The 
bibliographies  of  those  papers  also  provide  numerous  important  references.  In  particular,  we  have 
been  fortunate  to  be  a  part  of  the  multi-laboratory  speech  recognition  program  sponsored,  since 
1985.  by  the  Defense  Advanced  Research  Projects  Agency,  and  have  benefited  from  interactions 
with  our  many  colleagues  in  that  program.  References  [32]— [43]  provide  a  representative  sampling 
of  speech  recognition  work  at  other  laboratories  participating  in  the  DARPA  program  from  1985 
through  the  present  time. 
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2.  THE  PROBLEM  OF  ROBUSTNESS  TO  STRESS  AND  NOISE  IN  THE 

AIRCRAFT  ENVIRONMENT 


A  pilot  in  a  high-performance  military  aircraft  operates  in  a  heavy  workload  environment, 
where  his  hands  and  eyes  are  busy  and  speech  recognition  could  be  of  significant  advantage.  For 
example,  a  speech  recognizer  could  be  used  to  set  a  radio  frequency  or  to  choose  a  weapon,  without 
requiring  hand  motion  or  loss  of  visual  contact  with  other  aircraft  or  with  the  terrain.  The  potential 
improvement  in  pilot  effectiveness  could  be  extremely  significant  in  critical  situations. 

A  speech  recognizer  in  such  a  cockpit  environment  must,  however,  cope  with  severe  difficulties. 
The  pilot  may  be  exposed  to  high  ambient  acoustic  noise,  encumbered  by  equipment  such  as  an 
oxygen  mask  and  headphones,  kept  busy  with  a  stressful  workload  task,  and  subjected  to  physical 
and  psychological  stress.  These  factors  all  produce  significant  changes  in  the  speech  signal,  which 
make  the  recognition  task  more  difficult.  Both  acoustic  noise  and  headphones  (which  disturb 
the  auditory  feedback  path)  cause  the  speaker  to  talk  louder  (Lombard  effect  [44]).  with  changed 
spectral  tilt  and  possibly  with  altered  timing  (these  effects  may  be  somewhat  mitigated  by  the  use 
of  sidetone).  Generally,  it  has  been  found  that  changes  in  speech  style  due  to  noise  are  a  greater 
problem  than  additive  noise  in  the  speech,  when  a  facemask  with  a  close-talking  microphone  is 
used.  Excitement,  fear,  workload,  and  distraction  produce  idiosyncratic  speech  changes  including 
fast  speech,  careless  speech,  or  speech  block. 

In  general,  acoustic  effects  due  to  the  stressed  environment  can  include  changes  in  spectral  tilt, 
formant  positions,  speaking  rates,  timing,  and  phonology.  Although  much  study  of  these  effects 
has  been  carried  out  [45] . [47],  the  effects  tend  to  vary  idiosyncratically  with  the  speaker  and  with 
the  situation,  and  are  difficult  to  predict. 
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3.  TECHNICAL  APPROACH  TO  ROBUST  RECOGNITION: 
ROBUSTNESS  ENHANCEMENTS  TO  HIDDEN  MARKOV  MODEL 

SYSTEMS 


In  developing  techniques  for  robust  recognition,  we  chose  at  the  beginning  of  this  research 
program  jlj  to  build  upon  the  Hidden  Markov  Model  (HMM)  approach  26  29].  Reasons  for 

choosing  this  approach  included  excellent  previously-reported  recognition  performance  results  in  a 
variety  of  applications,  effective  extendability  from  isolated-word  recognition  to  continuous  speech 
recognition,  and  (very  importantly)  trainability  from  observed  data  by  automatic  methods.  To  our 
knowledge,  we  were  the  first  to  develop  HMM  techniques  directly  focussed  on  the  problem  of  speaker 
stress,  and  the  positive  results  obtained  provide  strong  evidence  that  HMM  provides  an  excellent 
framework  for  developing  robust  recognition  techniques.  Currently.  HMM-based  techniques  are 
widely  applied  to  a  large  range  of  speech  recognition  problems  at  a  variety  of  laboratories  26;  29  . 
32  -  43  .  [46;. '50| .  and  generally  have  produced  the  most  successful  speech  recognition  systems 
across  both  isolated-word  and  continuous  speech  recognition  tasks. 

To  form  a  framework  for  work  in  robust  recognition,  we  first  developed  a  baseline  HMM  sys¬ 
tem  1.4]  using  continuous  observation  HMM  techniques,  where  continuous  parameters  rather  than 
discrete,  vector-quantized  symbols  (see  [28])  were  used  as  input  to  the  recognizer.  The  use  of  con¬ 
tinuous  observations,  rather  than  discrete  symbols  from  a  vector  quantizer  as  used  in  many  other 
current  systems  [36 -[39],  was  very  important  in  the  development  of  robustness  enhancements,  such 
as  the  perceptually-motivated  distance  metric  noted  below.  The  training  and  recognition  modules 
of  the  baseline  isolated-word  HMM  recognizer  are  illustrated  in  Figure  3-1.  With  reference  to  the 
baseline  HMM  system  and  word  models,  note  that  the  term  "Hidden  Markov  Model"  refers  to  the 
modeling  of  speech  (words,  in  this  case)  as  a  doubly-stochastic  process,  with  an  underlying  set 
of  states  (which  can  be  thought  of  as  states  of  the  speech  production  mechanism),  and  Markov 
transitions  between  the  states.  The  states  are  never  observed  (hence,  the  term  “Hidden"),  but  for 
continuous-observation  HMM  the  emissions  of  the  observed  parameters  in  each  state  (mel  cepstra 
in  our  models)  are  modeled  according  to  probability  distributions  specific  to  that  state.  For  discrete 
observation  systems,  the  model  includes  the  probabilities  of  each  vector-quantized  symbol  being 
emitted.  Once  the  researcher  has  specified  the  form  of  the  model,  there  is  an  efficient  iterative 
training  algorithm  (the  Baum-Welch  or  forward-backward  algorithm,  see  e.g.,  [2S] )  «I.ich  auto¬ 
matically  trains  the  model  parameters  from  training  speech  (in  this  case,  a  number  of  samples  of 
each  word  in  the  vocabulary).  For  recognition,  there  is  an  efficient  algorithm  (the  Viterbi  [25]- 
[29]  technique)  which  chooses  the  most  likely  word  given  a  sequence  of  observation.  The  baseline 
system,  described  in  detail  in  [1,4],  uses  mel-frequency  cepstra  [31]  computed  every  10  ms,  as  its 
fundamental  observation  parameters.  The  system  is  termed  a  “diagonal  covariance"  system,  in  that 
the  joint  probability  density  function  of  the  cepstra!  parameters  is  assumed  to  be  a  multi-variate 
Gaussian  distribution  with  diagonal  covariance  matrix.  The  baseline  recognizer  uses  the  standard 
Baum-Welch  iteration  for  HMM  training,  and  applies  a  Viterbi  recognizer  to  find  the  single  highest 
scoring  path  through  the  HMM  network.  The  vocabulary  word  corresponding  to  this  highest  score 
is  selected  as  the  recognized  word. 
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Figure  3-1.  Baseline  HMM  isolated-word  recognition  system. 
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Many  of  the  robustness  enhancements  which  have  been  developed  and  tested  ( [  1]- { 1 7] )  are 
indicated  in  Figure  3-2.  The  specific  robustness  enhancements,  which  are  numbered  and  italicized 
in  Figure  3-2  for  reference,  include  enhancements  to  selection  of  training  utterances  [1,7]  to  the 
acoustic  parameters  [1];  to  the  HMM  model  form  and  constraints  used  in  training  and  recognition 
[l]-[7j:  to  background  noise  modeling  [4]:  and  to  discrimination  of  acoustically-similar  words  [9, 10] . 
In  summary,  the  key  robustness  enhancements  include: 

1.  Multi-style  training,  in  which  the  HMM  system  is  trained  on  speech  spoken  in 
a  variety  of  speech  styles  to  model  the  variabilities  due  to  stress: 

2.  Time-differential  mel-cepstral  parameters,  used  in  addition  to  the  basic  mel- 
cepstral  parameters,  to  better  account  for  the  rate  of  change  in  the  speech 
signal; 

3.  Robust  parameter  estimation  techniques,  including  grand  variance  (averaging 
or  tying  variances  over  all  HMM  modes)  to  compensate  for  limitations  in  the 
amount  of  training  data; 

4.  A  stress-tolerant  distance  measure  using  a  perceptually-motivated  diagonal  co- 
variance  for  HMM  nodes,  designed  to  reduce  the  sensitivity  of  the  recognizer 
to  stress- induced  speech  variability  such  as  spectral  tilt  (this  distance  measure 
represents  an  alternative  to  the  robust  variance  estimation  techniques  noted  in 
Item  3); 

n.  Use  of  improved  durational  models  to  better  model  the  time  spent  in  each  state 
of  the  HMM  network  (these  were  found  to  be  very  computationally  expensive, 
and  not  of  sufficient  benefit  to  be  included  in  our  final  systems); 

6.  Adaptive  modeling  of  the  background  noise,  including  facemask  breath  noise 
models  where  appropriate; 

7.  A  cepstral  domain  stress  compensator  as  a  preprocessor  for  the  HMM  recog¬ 
nizer: 

8.  A  second-stage,  feature-based  discriminant  analysis  system  designed  to  distin¬ 
guish  acoustically-similar  words. 

Items  (1-6)  above  fit  directly  into  the  HMM  framework,  while  (7-8)  are  integrated  into  the 
recognizer,  but  outside  the  HMM  framework.  A  variety  of  other  techniques  have  been  developed 
and  tested  during  the  course  of  this  work,  including  discriminant  clustering  to  reduce  the  number 
of  parameters  and  subword  models,  dynamic  adaptation  to  talker  and  environment,  and  a  variety 
of  techniques  for  automatic  adjustment  of  model  complexity  to  the  available  amount  of  training 
data. 

The  above  serves  as  an  outline  of  the  approach  to  robustness.  More  discussion  on  the  robustness 
enhancements,  and  on  their  effect  on  recognition  performance,  will  be  presented  in  the  sections  to 
follow. 
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Figure  3-2.  Robust  HMM  isolated- word  recognition  system  with  robustness  enhance¬ 
ments  indicated  in  italics. 


4.  DATA  BASES  OF  SPEECH  PRODUCED  UNDER  STRESS  AND  IN 

NOISE 


Collection  of  a  large,  systematic  data  base  of  speech  produced  under  real  stress  conditions  is 
a  very  difficult  task.  Our  approach  has  been  to  rely  heavily  on  speech  produced  with  “simulated 
stress."  where  the  talker  is  asked  to  vary  the  style  of  speaking  to  exhibit  the  range  of  acoustic 
variation  typical  of  stressed  speech.  In  addition  to  style  variation,  laboratory  conditions  of  workload 
stress  have  been  utilized.  The  effects  of  noise  exposure  in  the  ears  (Lombard  effect  [44,45,47])  have 
been  observed  to  produce  similar  changes  to  speech  as  those  produced  under  stress,  and  our  data 
bases  have  generally  included  the  Lombard  condition.  Two  primary  data  bases  have  been  used  for 
the  development  and  test  of  robust  isolated-word  recognition  systems. 

1.  The  “Tl-stress"  data  base,  a  simulated-stress  data  base  provided  to  us  by 
Texas  Instruments  [1.42].  This  data  base  uses  a  105-word  "pilot"  vocabulary. 

It  includes  8  talkers  (5  male.  3  female)  speaking  in  five  talker  styles  (normal, 
fast.  loud.  soft,  and  shout)  and  the  Lombard  condition.  There  are  5  training 
utterances  (normal  speech)  per  word  per  talker,  and  2  test  utterances  per  word 
per  condition  per  talker.  The  data  base  includes  a  total  of  14,  280  utterances. 

2.  The  "Lincoln-stress"  data  base  [7]-[9j.  collected  at  Lincoln.  This  data  base 
uses  a  35- word  vocabulary  (a  subset  of  the  105- word  TI  vocabulary)  which  was 
selected  to  include  a  number  of  word  subsets  which  are  difficult  for  recogni¬ 
tion  systems,  such  as  {go.no.oh}.  and  {six. fix}.  It  includes  9  talkers  (6  male. 

3  female)  from  3  different  dialect  groups,  with  speech  produced  for  11  condi¬ 
tions:  8  talking  styles  (normal,  slow.  fast.  soft,  loud,  clear  enunciation,  angry, 
and  question  pitch):  the  Lombard  condition,  and  while  performing  a  motor- 
workload  task  at  two  calibrated  levels  of  difficulty.  There  are  12  training  ut¬ 
terances  (normal  speech)  per  word  per  talker,  and  2  test  utterances  per  word 
per  condition  per  talker.  The  data  base  includes  a  total  of  10.  710  utterances. 

Additional  isolated-word  recognition  experiments  have  been  conducted  on  a  standard,  normally- 
spoken  20- word  vocabulary  data  base  [30] .  collected  by  Texas  Instruments,  on  which  many  systems 
have  been  tested.  This  data  base  is  sometimes  referred  to  as  "TI-20." 

Robust  isolated-word  recognition  experiments  and  results  on  all  the  data  bases  described  above, 
will  be  described  in  Section  5.  Some  of  our  robust  continuous-speech  recognition  development  and 
experiments  (see  Section  7)  have  been  conducted  using  the  "DARPA-robust"  continuous  speech 
data  base  [12.42].  This  data  base  uses  a  pilot-oriented  finite-state  grammar  and  a  207-word  vocab¬ 
ulary.  Sentences  typically  range  from  4  to  8  words,  and  the  grammar  is  very  constrained.  In  our 
work,  we  used  a  subset  of  the  data  base  including  4  training  conditions — normal,  fast.  loud,  and  90 
dB  pink  noise.  2  simulated  F-16  conditions,  and  the  shout  condition.  All  conditions  were  recorded 
using  fighter  pilot  facemasks  and  headphones.  The  portion  of  the  data  base  we  used  included  5 
male  speakers,  with  a  total  of  about  540  sentences  per  training  condition  and  220  sentences  per 
test  condition. 
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Recently,  most  of  our  continuous  speech  recognition  work  has  used  the  normally-spoken,  991- 
word  vocabulary.  DARPA  resource-management  (RM)  data  base,  which  is  well-documented  [41], 
and  has  been  used  by  many  speech  research  groups.  More  description  of  this  data  base  will  be 
provided  in  Section  9,  where  large-vocabulary  continuous  speech  recognition  work  is  described. 
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5.  ROBUST  ISOLATED  WORD  RECOGNITION  EXPERIMENTS  AND 

RESULTS 


This  section  summarizes  experiments  and  results  in  isolated-word  recognition  of  both  stressed 
and  normal  speech.  First,  overall  results  on  the  Tl-stress  data  base  are  presented,  demonstrating 
the  significant  improvements  for  stressed  speech  due  to  me  robustness  enhancements.  The  improve¬ 
ments  for  normal  speech  are  also  noted.  Then.  Lincoln  work  on  a  variety  of  specific  enhancement 
techniques  is  described,  including  experiments  on  both  the  Tl-stress  and  Lincoln-stress  data  bases. 

Results  obtained  with  the  robust  HMM  isolated-word  recognition  system  are  illustrated  in 
Figure  5-1  for  the  Tl-stress  data  base  (105-word  vocabulary,  8  talkers,  5  training  and  2  test  utter¬ 
ances  per  talker  per  condition).  All  experiments  were  speaker-dependent,  in  that  the  system  was 
trained  for  each  speaker  using  training  examples  of  the  vocabulary  words  spoken  by  that  speaker. 
Three  systems  are  compared,  all  of  which  are  diagonal-covariance-matrix.  Gaussian  probability 
density  systems  with  mel-cepstral  observations  and  whole-word  models.  The  “textbook''  baseline 
system  uses  straightforward  training  of  nodal  variances  and  means.  The  robust  “normal  train¬ 
ing"  system  was  also  trained  only  on  normally-spoken  speech,  but  included  enhancements  such  as 
perceptually-motivated  distance  measure,  time-differerential  parameters,  and  adaptive  background 
estimation.  The  robust  “multi-style’’  system  included  samples  of  the  different  speech  styles  in 
training  (of  course  the  actual  test  utterances  were  always  separate  from  the  training  utterances). 
Five  conditions  were  tested.  The  baseline  HMM  worked  reasonably  well  for  normal  speech,  but 
degraded  for  the  simulated-stress  and  Lombard  conditions.  The  improvements  with  the  enhance¬ 
ments  are  apparent.  The  percentage  substitution  rates  for  normal  speech,  and  for  the  average 
over  five  conditions,  are  given  alongside  the  figure.  For  the  average  over  the  conditions,  more  than 
an  order-of-magnitude  improvement  was  achieved.  In  addition,  the  robustness  enhancements  sig¬ 
nificantly  improved  performance  for  normal  speech.  Best  results  were  obtained  using  multi-style 
training.  However,  the  results  using  the  robust  system  with  normal  training  were  also  excellent. 

The  robust  isolated-word  recognition  system,  which  was  developed  using  the  “Tl-stress”  data 
base,  yielded  similar  performance  results  on  the  “Lincoln-stress”  data  base,  for  simulated-stress. 
Lombard,  and  workload  stress  conditions.  The  system  was  also  tested  on  the  standard  TI  20-word 
vocabulary,  normal-speech  data  base  [30j  and  yielded  the  best  results  known  to  date  on  that  data 
base — 99.94  percent  correct,  on  the  first  test  on  that  data  base.  The  system  also  performed  well 
in  a  number  of  informal  live-input  tests,  including  over  long-distance  telephone  lines.  The  basic 
robust  HMM  isolated-word  recognition  system,  and  results  and  experiments  on  the  Tl-stress  data 
base  and  on  the  TI  20- word  normal  speech  data  base,  are  described  in  [4]. 

The  following  paragraphs  give  a  summary  of  Lincoln  work  on  a  variety  of  specific  robustness 
techniques,  including  experiments  on  both  the  Tl-stress  and  Lincoln-stress  data  bases. 

An  issue  which  was  investigated  early  in  our  work  [1]  was  the  relative  importance  of  the 
following  two  major  effects  of  noise  on  recognition  in  the  fighter  cockpit:  (I)  the  noise  causes 
the  pilot  to  speak  louder  and  more  distinctly  (Lombard  effect),  and  (2)  the  noise  leaks  into  the 
microphone  and  degrades  the  input  signal-to-noise  (S/N)  ratio.  Experiments  were  performed  using 
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Figure  5-1.  Performance  of  robust  HMM-word  recognition  system  on  the  105  word 
vocabulary  Tl-s tress  data  base.  All  experiments  are  speaker-dependent.  The  substan¬ 
tial  improvements  over  the  baseline  system  for  the  simulated-stress  and  noise  exposure 
(Lombard)  conditions  are  apparent.  The  robustness  enhancements  also  improved  the 
performance  for  normal  speech. 


recordings  made  at  the  Wright -Patterson  Air  Force  Medical  Research  Laboratory.  Words  in  a 
25-word  vocabulary  were  produced  by  one  talker  wearing  a  facemask  and  helmet  in  an  ambient 
condition  and  with  simulated  fighter  aircraft  (F-16)  background  noise  levels  of  95.  105,  and  115 
dB  sound  pressure  level  (SPL).  The  S/N  ratio  at  the  recognizer  input  (after  the  noise-cancelling 
microphone)  was  determined  for  the  ambient  training  condition,  and  for  training  samples  using 
speech  collected  under  the  multiple  noise  conditions.  Recognition  results  are  shown  in  Figure  5-2. 
Recognition  results  for  normal  training  indicate  severe  performance  degradation  in  noise,  although 
the  S/N  ratio  remains  high  (23  dB)  even  for  the  highest  noise  level.  However,  good  performance  is 
achieved  with  multi-condition  training  using  training  data  obtained  under  multiple  noise  conditions. 
These  data,  together  with  careful  listening  to  the  recordings,  strongly  suggest  that  the  degraded 
performance  was  due  to  the  Lombard  effect,  and  not  due  to  additive  noise.  These  results,  which  are 
consistent  with  those  presented  in  [42],  indicate  that  the  Lombard  effect  is  more  important  than 
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additive  noise,  at  least  in  an  environment  where  a  noise-cancelling  microphone  is  used.  The  results 
in  Figure  5-2  also  demonstrate  that  training  under  multiple  conditions  is  an  effective  technique  to 
compensate  for  the  Lombard  effect. 
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Figure  5-2.  HMM  system  performance  in  a  simulated  F-16  noise  environment  with 
normal  training  and  with  training  under  multiple  noise  conditions. 


The  focus  of  [5,6]  is  a  particular  robustness  enhancement  technique  wherein  the  basic  recog¬ 
nition  parameters  (mel- frequency  cepstra)  are  modified  adaptively  to  compensate  for  variations 
due  to  stress.  This  adaptation  is  shown  to  compensate  for  spectral  tilt  and  to  produce  significant 
performance  improvements  for  isolated-word  recognition  systems  trained  with  normal  speech.  The 
studies  conducted  in  developing  these  compensation  techniques  also  yielded  important  data  [5,6] 
on  the  specific  effects  of  various  speaking  styles  on  the  cepstral  parameters. 

Multi-style  training,  and  experiments  and  results  on  the  Lincoln-stress  data  base,  are  the  focus 
of  [7,8].  The  effectiveness  of  training  on  multiple  talker  styles  in  improving  recognition  performance 
for  stress  and  noise  conditions  (workload,  Lombard)  not  included  in  the  training  data  is  reported 
and  discussed.  Overall  recognition  accuracy  of  99  percent  on  the  difficult  Lincoln-stress  data  is 
reported,  achieved  via  a  combination  of  multi-style  training  and  other  robustness  enhancements. 
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A  second-stage  discriminant-analysis  system,  developed  to  serve  as  a  post-processor  to  the 
HMM  recognizer  in  order  to  resolve  confusion  between  acoustically-similar  words,  is  described  in  [9, 
10).  This  discriminant  system  is  trained  by  passing  samples  of  every  word  in  the  vocabulary  through 
the  HMM  models  of  every  word  in  the  vocabulary,  to  explicitly  model  acoustic  differences  between 
words.  A  statistically-based  sifting  technique  is  described  which  selects  only  those  parameters 
which  are  likely  to  be  effective  in  discrimination.  Performance  improvements  relative  to  the  robust 
single-stage  HMM  are  reported  for  the  Lincoln-stress  data  base,  contributing,  for  example,  to  the 
overall  99  percent  recognition  accuracy  on  that  data  base. 

An  illustrative  summary  [7]  of  the  effects  of  various  robustness  techniques  on  performance  with 
the  Lincoln-stress  data  base  is  shown  in  Figure  5-3.  The  basic  HMM  system,  augmented  with  a 
variance-limiting  technique  to  prevent  underestimates  of  parameter  variances,  achieved  17.5  percent 
error  rate  (averaged  over  9  talkers  and  10  conditions).  Multi-style  training  reduced  the  error  rate 
to  6.9  percent,  and  the  use  of  differential  cepstral  parameters  (in  addition  to  the  basic  cepstral 
parameters)  reduced  error  rate  further  to  3.2  percent.  Grand  variance  techniques,  which  reduce 
the  effect  of  limited  training  data  by  estimating  cepstral  parameter  variances  as  an  average  over  all 
words  and  phonemes  (the  means  are  separately  estimated),  lowered  the  error  rate  to  1.6  percent. 
Finally,  the  second-stage  discriminant  analysis  corrected  enough  of  the  remaining  confusion  to 
reduce  the  error  rate  on  this  data  base  to  1  percent. 

A  technique  for  dynamic  adaptation  of  HMM  isolated-word  model  parameters  to  new  speakers 
and  to  stress-induced  speech  variations  is  described  in  [13].  Tests  were  performed  on  the  Lincoln- 
stress  data  base.  Results  of  these  speaker  adaptation  experiments  are  illustrated  in  Figure  5-4.  It 
was  found  to  be  crucial  to  allow  user  feedback  to  assist  the  adaptation  process  by  correcting  errors 
produced  by  the  system.  This  is  illustrated  in  Figure  5-4.  bv  the  difference  between  the  case  of 
adaptation  on  all  tokens  (the  user  must  supply  the  correct  answer  when  an  error  is  made)  and  the 
case  of  adaptation  only  on  correct  recognition.  With  corrective  feedback  from  the  user  provided, 
speaker-adaptation  experiments  produced  error  rates  equivalent  to  speaker-trained  systems  after 
presentation  of  only  a  single  new  token  per  vocabulary  word.  Stress-condition  adaptation  experi¬ 
ments  produced  results  comparable  to  multistyle-trained  systems  after  the  presentation  of  several 
new  tokens  per  vocabulary  word.  Similar  adaptation  techniques,  focusing  primarily  on  adaptation 
to  noise  conditions,  are  presented  in  [51]. 

Finally,  a  training  procedure  called  discriminant  clustering  for  automatically  generating  sub- 
word  HMM  models  for  an  IWR  system  is  presented  in  [14].  (A  similar  technique  for  template-based 
recognition  is  described  in  [46]).  HMM  node  sequences  from  whole-word  models  were  merged  using 
statistical  clustering  techniques.  This  procedure  reduced  computation  during  recognition  (on  the 
Lincoln-stress  data  base)  by  roughly  one-third  without  significant  increase  in  error  rate.  Additional 
clustering  work,  focusing  specifically  on  triphones,  is  described  in  [12]. 
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Figure  5-3.  Effects  of  robustness  techniques  on  HAIM  recognition  performance  on  the 
Lincoln-stress  35-word  vocabulary  data  base. 
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Figure  5-4.  Experiments  on  adaptation  ofHMM  recognizer  to  a  new  speaker  ( normal 
speech).  All  points  are  averaged  over  9  speaker  pairs  where  initial  training  models  are 
obtained  from  one  speaker  then  tested  on  another. 
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6.  ROBUST  CONTINUOUS  SPEECH  RECOGNITION  SYSTEM 


The  performance  improvements  achieved  on  simulated-stress  isolated- word  data  bases  led  us  to 
extend  the  robust  HMM  work  to  continuous  speech  recognition.  Although  isolated-word  recognition 
could  be  useful  in  environments  such  as  the  cockpit,  users  generally  would  prefer  continuous  speech 
to  isolated  words,  as  long  as  high  recognition  accuracy  is  maintained.  A  fairly  restrictive  grammar 
of  command  phrases  would  be  useful  to  pilots  and  could  constrain  the  recognition  task  sufficiently 
to  provide  good  performance  in  the  cockpit  environment. 

A  key  step  in  moving  to  a  continuous  speech  recognition  system  was  to  change  from  whole- 
word  to  subword  (phone)  models.  In  particular,  this  removed  the  need  to  train  on  all  words  in  the 
vocabulary.  In  deriving  subword  models,  we  first  derived  from  a  dictionary  a  representation  of  each 
word  as  a  sequence  of  phones.  Phone  context  was  taken  into  account  by  allowing  separate  context- 
dependent  phone  models,  referred  to  as  triphones  ([12.36.38]).  for  each  left  and  right  context  in 
which  a  phone  occurred.  Phones  (or  triphones)  were  modeled  as  linear  sequences  of  HMM  nodes, 
and  words  were  modeled  as  linear  sequences  of  triphones. 

For  the  continuous  speech  recognition  system  it  was  also  necessary  to  incorporate  word  order 
constraints  (syntax  and  semantics).  A  finite-state  grammar  was  introduced,  which  could  be  adapted 
to  a  variety  of  tasks  of  different  difficulty.  The  primary  measure  of  difficulty  for  a  grammar  in  this 
work  is  perplexity,  defined  as  two  raised  to  the  power  of  the  entropy  of  the  language,  or  the  geometric 
average  of  the  branching  factor  (number  of  words  allowed  to  follow  a  given  word  in  the  grammar). 

Robustness  features  which  were  extended  from  the  isolated-word  to  the  continuous  speech 
system  included  the  basic  approach  of  continuous  observation  HMM.  mel-cepstral  observations 
with  temporal  differences,  perceptually-motivated  distance  measures  or  tied  variances,  and  adaptive 
background  models. 

Some  of  the  additional  features  that  have  been  developed  and  tested  in  various  versions  of  the 
robust  continuous  speech  recognition  system  include: 

1.  Training  of  triphone  models  using  an  unsupervised  monophone  bootstrap  so 
that  only  an  orthographic  transcription  of  the  training  data  (no  hand-marking) 
is  needed: 

2.  Gaussian  mixtures  of  variable  order  to  model  the  probability  density  functions 
of  the  observations  as  weighted  sums  of  Gaussian  densities,  rather  than  as  a 
single  Gaussian  for  each  probability  density  function; 

3.  Word-context-dependent  triphone  models  to  account  for  interword  context; 

4.  Extrapolation  to  model  missing  triphones. 

More  detail  on  the  development,  evolution,  and  testing  of  the  robust  CSR  system  is  given  in 
[12,15,16,17],  A  sampling  of  results  is  presented  in  the  next  three  sections. 
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7.  CONTINUOUS  SPEECH  RECOGNITION  EXPERIMENTS  ON  THE 

DARPA-ROBUST  DATA  BASE 


The  continuous  speech  recognition  system  was  first  developed  and  tested  on  the  “DARPA- 
robust”  data  base  [12]  (see  Section  4).  Performance  on  a  207-word,  perplexity-14  task  was  2.5 
percent  word  error  rate  (best  speaker)  and  5  percent  (4-speaker  average)  for  the  normal  condition 
of  the  data  base.  Performance  was  poorer  under  the  various  conditions  of  simulated  stress  and  noise, 
which  indicated  that  this  was  a  rather  difficult  data  base.  However,  there  were  a  number  of  problems 
in  gathering  this  data  base,  including  poor  speaker  motivation  and  equipment  problems,  which  may 
have  unduly  increased  the  difficulty  of  the  recognition  task.  These  problems  unfortunately  may 
have  masked  the  effects  we  were  trying  to  observe,  namely,  those  due  to  stress  and  noise. 

A  small  data  base  using  the  same  vocabulary  and  grammar  was  gathered  for  a  motivated 
speaker  under  office  conditions,  and  a  word  error  rate  of  0.9  percent  was  achieved.  This  indicated 
that  the  vocabulary  and  grammar  were  not  particularly  difficult,  and  that  the  problems  were  in 
the  speech  recordings  themselves.  Our  observations  and  results  on  the  DARPA-robust  data  base 
were  similar  to  those  of  our  colleagues  at  Texas  Instruments  [43].  Based  on  the  general  problems 
observed  with  the  DARPA-robust  data  base,  we  elected  to  further  develop  and  test  the  continuous 
speech  recognition  system  on  other  data  bases,  primarily  the  DARPA  Resource  Management  data 
base. 
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8.  VOICE-CONTROLLED  FLIGHT  SIMULATOR:  A  DEMONSTRATION 
OF  REAL-TIME  CONTINUOUS  SPEECH  RECOGNITION  UNDER 

STRESS 


In  order  to  demonstrate  and  test  the  capabilities  of  onr  robust  recognition  system  in  a  real-time 
task  which  simulates  the  type  of  stress  typical  of  the  cockpit  environment,  we  have  developed  a 
voice-controlled  flight  simulator  as  illustrated  in  Figure  8-1.  The  task  is  takeoff,  flight,  and  landing 
of  a  voice-controlled,  simulated  F-15  aircraft.  It  is  not  intended  to  model  an  actual  flight  task,  as 
we  do  not  recommend  that  critical  functions,  such  as  landing,  be  handled  by  voice.  However,  the 
real-time  interactive  environment  was  intended  to  be  typical  of  the  flight  environment,  where  pilots 
might  control  noncritical  tasks  by  voice  (see  Section  2).  The  F-15  flight  simulator  was  developed 
at  Lincoln  (by  D.  Paul)  and  was  judged  to  be  realistic  by  a  number  of  military  pilots.  Figure  8-1 
shows  the  speech  recognizer,  a  depiction  of  a  typical  flight  pattern,  and  the  flight  simulator  display 
as  viewed  by  the  operator.  The  flight  simulator,  as  well  as  the  HMM  recognizer,  operate  in  real¬ 
time  on  a  SUN-4  computer.  The  signal  processing  front-end  is  implemented  in  real-time  on  a 
Lincoln-built  signal  processing  computer. 

Although  the  demonstrations  are  conducted  in  a  relatively  benign  acoustic  environment,  the 
speaker  stress  is  real,  due  largely  to  the  time-critical  nature  of  the  task.  The  system  uses  the 
HMM  continuous  speech  recognizer  with  a  28-word  vocabulary,  perplexity-7  grammar  to  control 
the  aircraft.  In  one  series  of  demonstrations  requiring  over  1000  command  sentences,  only  one  error 
occurred.  Our  general  experience  is  that  the  correct  phrase  is  recognized  more  than  99.9  percent  of 
the  time.  The  results  for  this  system  demonstrate  that  speaker  stress  for  a  motivated  speaker  can 
be  tolerated  for  a  simple  recognition  task,  even  when  very  high  recognition  accuracy  is  required. 
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Figure  H-l.  Voice-amt  rolled  flight  simulator,  which  demonstrates  real-time  continu¬ 
ous  speech  recognition  under  task-induced  stress.  Illustrated  are  the  speech  recognizer, 
a  typical  flight  pattern,  and  the  flight  simulator  display. 


9.  LARGE- VOCABULARY  CONTINUOUS  SPEECH  RECOGNITION 


Recently,  the  Lincoln  stress-resistant  HMM  CSR  has  been  extended  to  large- vocabulary,  normally- 
spoken,  continuous  speech  recognition  for  both  speaker-dependent  (SD)  and  speaker-independent 
(SI)  tasks.  Development  and  test  for  this  effort  have  been  carried  out  using  the  DARPA  Resource 
Management  data  base  41]  which  is  being  widely  used  by  a'  number  of  other  organizations.  This 
data  base  consists  of  sentences  typical  of  those  which  would  be  spoken  in  a  Naval  resource  man¬ 
agement  task,  allowing  data  retrieval  and  management  of  ships  and  other  Naval  resources.  The 
speech  is  normally-spoken,  the  vocabulary  is  991  words,  and  the  average  sentence  length  is  eight 
words.  Tests  were  run  both  with  an  "official”  word-pair  grammar  (list  of  allowable  word  pairs,  no 
assigned  probabilities)  with  a  recognition  perplexity  of  60  and  with  "no  grammar"  (all  word  pairs 
allowed),  corresponding  to  a  perplexity  of  991.  The  speaker-dependent  portion  of  the  data  base 
has  12  speakers,  with  600  training  sentences  per  speaker  and  100  development  test  sentences  per 
speaker.  For  speaker-independent  work,  we  trained  on  2880  sentences  from  72  speakers  (SI-72) 
or  3990  sentences  from  109  speakers  (SI-109),  and  used  the  same  development  test  data  as  for 
speaker-dependent.  In  the  DARPA  program,  a  series  of  official  evaluation  tests  (October  1987. 
June  1988.  February  1989.  October  1989)  have  been  run  on  new  data  not  used  in  either  training 
or  development. 

The  general  features  of  the  Lincoln  large- vocabulary  CSR  system  are  as  summarized  in  Section 
6.  The  details  of  the  system  have  continued  to  be  refined,  with  the  goal  of  improved  performance,  in 
response  to  experimental  results  obtained  during  development  tests  and  in  official  evaluation  tests. 
Throughout  this  work,  the  Lincoln  continuous-observation  HMM  continuous  speech  recognizer  has 
achieved  performance  on  the  DARPA  resource  management  task  similar  to  that  of  the  other  leading 
DARPA  research  groups,  all  of  which  use  discrete  observation  HMM.  In  addition,  the  Lincoln  system 
has  been  the  only  one  to  be  tested  in  both  speaker-dependent  and  speaker-independent  tests  for 
all  the  official  evaluations. 

The  DARPA  resource  management  tests  have  all  been  for  normal  speech,  but  the  earlier 
Lincoln  work  provides  evidence  that  continuous  observation  HMM  has  robustness  advantages  for 
stressed  speech.  Current  work  on  other  problems  requiring  robustness  (e.g..  spotting  of  key  words 
in  continuous  telephone  speech)  has  also  focused  on  continuous  observation  HMM. 

To  illustrate  some  highlights  of  the  Lincoln  work  in  large- vocabulary  continuous  speech  recog¬ 
nition.  it  is  useful  to  summarize  improvements  made  in  the  system  between  the  June  1988  and 
February  1989  official  tests  [15.16).  These  include  word-context-dependent  triphone  models,  vari¬ 
able  mixtures,  and  tied  mixtures  [16.17]  (Note:  tied  mixtures  were  developed,  but  not  actually 
used  in  the  official  February  1989  tests).  Work  on  tied  mixtures  at  other  laboratories  is  described 
in  52.53  . 

Word-context-dependent  models  extend  the  triphone  context  to  include  the  phone  on  the  other 
side  of  a  word  boundary.  For  example,  the  “ee”  in  “three  words”  would  have  a  triphone  model 
distinct  from  the  “ee”  phone  in  “three  phones."  Word-context-free  models  would  not  distinguish 
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between  those  two  cases.  Training  and  recognition  strategies  were  developed  for  word-context- 
dependent  models.  For  training,  the  training  data  is  used  twice  per  iteration  of  the  Baum- Welch 
training  algorithm — once  to  train  word-context-free  triphone  models  and  once  to  train  observed 
word-context  dependent  triphones.  For  recognition,  a  significant  increase  in  complexity  is  needed 
to  account  for  the  various  word-boundary  topologies.  The  speaker-dependent  system  using  word- 
context-dependent  triphones  achieved  significant  improvements  over  the  word-context-free  system — 
3.0  percent  versus  5.2  percent  error  rate  on  development  data.  However,  the  speaker-independent 
word-context-dependent  results  were  worse  than  for  word-context-free  triphones,  indicating  that  the 
word-context-dependent  system  may  be  too  detailed  a  model  for  the  available  speaker-independent 
training  data. 

Variable-order  Gaussian  mixtures  were  introduced  to  attempt  to  match  the  complexity  of  the 
model  (the  number  of  Gaussians  per  mixture)  to  the  available  amount  of  training  data  (  the  number 
of  occurrences  of  a  particular  triphone  in  the  training  data).  Small  improvements  were  obtained 
for  the  speaker-independent  task,  and  the  results  indicated  that  the  basic  idea  was  successful  but 
that  the  function  chosen  to  select  the  mixture  order  was  not  optimum. 

Finally,  a  version  of  tied  mixtures  was  tested  and  shown  to  provide  a  small  improvement  for 
the  speaker-independent  task.  This  technique  shares  Gaussians  among  different  phones  (different 
weights  are  used  for  each  phone)  and,  hence,  reduces  the  numbers  of  Gaussians.  It  provides  another 
form  of  matching  model  complexity  to  the  amount  of  training  data  by  allowing  the  system  to 
automatically  reduce  the  number  of  degrees  of  freedom  in  the  model  when  there  is  insufficient 
training  data. 

As  a  sample  of  results  obtained  with  the  Lincoln  continuous  speech  recognition  system  on  the 
DARPA  Resource  Management  data  base.  Figure  9-1  presents  results  obtained  on  the  February 
1989  evaluation  test  set.  A  full  set  of  tests  was  run  for  both  speaker-dependent  and  speaker- 
independent  recognition,  and  results  are  similar  to  those  of  the  other  high  performing  DARPA- 
supported  groups.  The  word  error  results,  with  grammar,  of  3.7  percent  (speaker-dependent)  and 
9.8  percent  (speaker-independent),  represent  state-of-the-art  performance,  but  performance  needs 
to  be  improved  substantially  to  achieve  acceptable  sentence  accuracy.  More  work  in  basic  speech 
recognition,  as  well  as  in  application  of  syntax  and  semantics,  is  needed  and  is  in  progress  at  Lincoln 
and  at  numerous  other  laboratories. 

The  results  in  Figure  9-1  illustrate  the  substantial  impact  of  the  grammar  perplexity,  as  well 
as  the  comparison  between  speaker-dependent  and  speaker-independent  training.  In  addition,  com¬ 
parisons  are  provided  of  speaker-independent  performance  of  the  same  system  trained  both  on  72 
speakers  (SI-72)  and  on  109  speakers  (SI- 109),  with  better  results  occurring  for  the  better-trained 
system.  It  is  probable  that  error  rate  could  be  further  reduced  by  simply  increasing  the  amount  of 
training  data. 


24 


WORD  ERROR  RATES  (PERCENT) 


WORD-PAIR  GRAMMAR  NO  GRAMMAR. 

PERPLEXITY  =  60  PERPLEXITY  =  991 


figure  9-1.  Word  error  rates  for  the  Lincoln  HMM  recognizers  on  the  February  1989 
"official  tests"  on  the  DARPA  Resource  Management  data  base. 


10.  INTEGRATION  OF  NEURAL  NETWORKS  AND  HIDDEN  MARKOV 

MODELS 


An  important  area  being  explored  for  the  improvement  of  speech  recognition  performance  is  the 
development  and  application  of  neural  network  classification  algorithms  and  architectures.  Neural 
nets  offer  the  potential  of  new  algorithms,  dynamic  adaptation,  and  computational  efficiency,  and 
have  been  the  subject  of  intensive  efforts  at  a  number  of  laboratories.  Work  at  Lincoln  in  neural 
nets  for  speech  recognition  began  in  1987,  with  a  major  focus  on  comparison  of  neural  net  and 
conventional  pattern  classification  algorithms  [20]— [23] ,  and  on  efficient  neural  net  implementations 
of  conventional  recognizers,  including  a  neural  net  implementation  of  the  Viterbi  decoder  used  in 
HMM  [23], 

Speech  recognition  using  neural  networks  has  so  far  led  to  good  results  only  for  small-vocabulary 
tasks  that  involve  low-level  units.of  speech,  such  as  phonemes  and  letters  in  the  ‘‘E-set”  [20].  Such 
problems  take  advantage  of  the  relatively  well-developed  status  of  static  neural  net  classifiers,  and 
of  their  capability  for  discrimination  between  similar  patterns.  However,  research  on  neural  net  dy¬ 
namic  pattern  classifiers,  which  are  required  for  more  difficult  isolated-word  and  continuous  speech 
tasks,  is  only  at  a  beginning  stage  and  has  met  with  limited  success. 

Our  current  approach  is  to  take  advantage  of  both  neural  network  (NN)  classifiers  and  the 
HMM  framework  (which  handles  the  time  dimension  in  dynamic  pattern  classification)  by  combin¬ 
ing  the  two  techniques.  Two  approaches  [25]  are  being  pursued  in  developing  combined  HMM/NN 
recognizers.  One  approach  applies  a  multi-layer  perception  (MLP)  as  a  second-stage  discriminator 
designed  to  overcome  HMM’s  weakness  in  focusing  on  segments  of  the  input  speech  that  are  impor¬ 
tant  for  discrimination.  The  motivation  of  this  two-stage  scheme  is  similar  to  earlier  discrimination 
work  described  in  [9,10],  but  takes  advantage  of  the  automatic  discrimination  training  capabil¬ 
ity  of  the  back- propagation  algorithm.  This  two-stage  HMM/NN  approach  has  yielded  improved 
recognition  accuracy  for  small  vocabulary  (“B,”  “D,”  “G”)  tasks,  but  scaling  problems  have  been 
encountered  for  larger  vocabularies.  Various  approaches  are  being  investigated  for  overcoming  these 
problems  with  larger  vocabularies. 

A  second  integration  strategy,  which  is  a  primary  focus  of  our  current  work,  is  to  use  a  neural 
net  for  acoustic/phonetic  feature  extraction  to  provide  local  distance  scores  as  shown  in  Figure  10-1 
[25].  Other  workers  [48.49]  have  obtained  promising  results  using  similar  approaches.  Our  approach 
is  distinguished  from  others  in  that  we  are  using  fully-automated  training,  without  need  for  hand¬ 
segmenting  training  data.  The  training  algorithms  being  explored  integrate  back-propagation  and 
other  neural  network  algorithms  into  the  iterative  forward-backward  training  algorithm.  Although 
training  is  very  computationally  intensive,  recognition  is  not  much  more  complex  than  a  standard 
HMM  system.  The  promise  of  the  hybrid  system  shown  in  Figure  10-1  is  that  it  uses  the  neural 
network  to  extract  features  while  using  the  HMM  for  time  alignment.  Currently,  we  have  imple¬ 
mented  a  basic  hybrid  system  and  have  obtained  initial  E-set  results  which  are  similar  to  existing 
high  performance  HMM  systems.  However,  we  have  not  yet  exploited  frame  context,  as  illustrated 
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in  Figure  10-1,  and  also  are  introducing  a  number  of  refinements  into  the  training  and  recognition 
algorithms. 
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Figure  10-1.  System  framework  for  integration  of  neural  network  classifiers  with 
HMM  recognition,  where  the  neural  net  is  used  for  acoustic-phonetic  feature  extrac¬ 
tion  over  multiple  input  speech  frames,  and  the  HMM  is  used  for  time-alignment  and 
temporal  decoding. 
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11.  CONCLUSIONS  AND  FUTURE  WORK 


The  work  described  in  this  report  began  with  a  focus  on  robust  isolated-word  recognition 
in  stressed,  noisy  environments  typical  of  the  aircraft  environment.  Enhancements  to  a  baseline 
HMM  system  were  developed  which  significantly  improved  recognition  performance  under  difficult 
conditions  and  which  also  improved  performance  for  normal  speech.  Later,  the  robust  HMM 
system  was  extended  successfully  to  continuous  speech  recognition  under  stress,  and  then  to  large- 
vocabulary  continuous  speech  recognition  for  normal  speech.  In  addition  to  the  HMM  work,  Lincoln 
has  successfully  developed  neural  net  algorithms  and  architectures  for  speech  recognition,  and 
current  work  includes  hybrid  HMM/NN  systems. 

Current  and  projected  future  areas  of  Lincoln  research  in  speech  recognition  include  (1)  devel¬ 
opment  of  robust  techniques  for  talker-independent  recognition  and  key  word  spotting  [19]  on  noisy 
and  distorted  speech:  (2)  research  into  the  application  of  speaker  recognition  techniques  to  improve 
speech  recognition  performance;  (3)  continued  development  and  improvement  of  HMM-based  con¬ 
tinuous  speech  recognition  techniques,  including  tied  mixtures  and  extensions  beyond  the  basic 
HMM  structure:  (4)  development  of  structures,  including  a  new  interface  specification  based  on  a 
stack  controller  [18],  for  integrating  speech  recognition  and  natural  language  systems  into  spoken 
language  systems;  and  (5)  continued  research  on  neural  network  techniques,  including  improved 
neural  network  classifiers  and  hybrid  HMM/NN  systems. 
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